Enumeration of RNA structures by Matrix Models 
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We enumerate the number of RNA contact structures according to their genus, i.e. the topological 
character of their pseudoknots. By using a recently proposed matrix model formulation for the RNA 
folding problem, we obtain exact results for the simple case of an RNA molecule with an infinitely 
flexible backbone, in which any arbitrary pair of bases is allowed. We analyze the distribution of 
the genus of pseudoknots as a function of the total number of nucleotides along the phosphate-sugar 
backbone. 



The prediction of foldings of single-stranded nucleic 
acids (like RNA molecules) is still a major open problem 
of molecular biology P|. Several methods are available 
for the prediction and description of the folding process in 
various conditions. Most of them are statistical models 
(both at equilibrium and out-of equilibrium) that have 
roots in combinatorial problems. Although these models 
are much simpler than the energy based ones (and thus 
cannot provide thermodynamical predictions), they often 
provide exact analytical solutions that give important in- 
sights on the phase-space structure and the entropy. For 
those reasons the combinatorics of contact structures of 
biopolymers has received great attention over the past 
thirty years [3- In the case of RNA- folding, a lot of 
attention has been paid to the combinatorics of contact 
structures that are planar (see e.g. 's'l or 'i'l and references 
therein) , but very little is known about non-planar struc- 
tures (i.e. structures with pseudoknots). In this Letter we 
explore a very schematic model for RNA folding which 
allows for the exact enumeration of all contact structures 
with fixed genus. This model, which is based on a sim- 
pler one that was proposed earlier in H, IE S 13 ; may be 
relevant for studying the behaviour of non-planar contri- 
butions. The partition function is that of a chain of L 
nucleotides in three dimensions: 
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where is the position vector in three dimensions of 
the fc-th base, and /({r}) is a function which takes into 
account the geometry, the stiffness and the sterical con- 
straints of the chain. The folding of the chain is caused 
by the hydrogen bonds that the bases can form. Since 
the hydrogen bonds saturate, a base can interact with 
only one other base at a time. The contribution from 
such interactions to the partition function is described 
by^L({r}): 
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- ^»i(i"»j)I4;(r/ci) + - 

i<j<k<l 



where Vij(vij) — eyiY>{—(3eijVij{vij)) is the Boltzmann 
factor associated with the energy Sij of making a bond 



between the ith and the jth base at distance r^. In 
this expression, (3 = 1/T denotes the inverse tempera- 
ture, and Vij{Yij) represents the (short range) space de- 
pendent part of the interaction. To further simplify the 
model, we will assume that the chain is infinitely flexible 
and we will neglect all sterical constraints, so that any 
pairing of bases is assumed to be feasible. Therefore, we 
can neglect all spatial degrees of freedom and write 
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where now Vij = exp(— /Je^ ). As shown in each term 
in Zl can be represented graphically by a suitable arc 
diagram. In such a representation the nucleotides are 
dots on an oriented horizontal line (which represents the 
RNA sugar backbone from the 5' end to the 3' end), and 
each base pair is drawn as an arc - above that line - be- 
tween the two interacting bases. In real RNA, not all 
pairs of nucleotides can interact. For instance, two bases 
which are too close to each other along the backbone (say 
within a distance of 4 bases) cannot form a hydrogen 
bond since the backbone is not flexible enough. More- 
over, for an RNA molecule one also usually assumes that 
only standard Watson-Crick pairs (A-U,C-G) and wob- 
ble pairs (G-U) are possible. These constraints greatly 
increase the difficulty of enumerating all possible struc- 
tures that are allowed. Among the set of all possible 
structures, one defines secondary structures of an RNA 
molecule as all structures which are represented by planar 
arc diagrams (no crossing of arcs). When the diagrams 
are non planar, one says that the RNA molecule contains 
one or more pseudoknots. Structures with pseudoknots 
can be classified according to the topological character of 
the corresponding arc diagram Such a classification 
can be made more explicit directly in eq. (0) , as explained 
in 5]. The main idea of j^l] is to consider the following 
integral over matrices; 
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Here ipi, i = \, . . . , L, are L independent N x N Hermi- 
tian matrices {ip^ = fi) and Hz^ill + ^he ordered 
matrix product (1 + ipi){l + (^2) • • • (1 + V'l)- The nor- 
malization factor is: 

= / n rfy'fee-^^-t^")-''-'^'^^) , (4) 
fe=i 

and V is the L x L symmetric matrix with elements Vij . 
The integral in eq. Q can be evaluated by using the 
Wick theorem. The result is a function of N which can 
be written as an asymptotic series at large N: 

Zl{N) = l + ^^J^ki 
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The relation with the expansion in eq. I^l is obvious. The 
two series coincide for = 1, whereas for > 1 the se- 
ries in eq. lO contains topological information. All the 
planar structures are given by the 0(1) term of eq. © 
and higher-order terms in 1 /N'^ correspond to RNA sec- 
ondary structures with pseudoknots. The classification 
of pseudoknots induced by this expansion is reviewed in 
i 

The most challenging problem in RNA-folding predic- 
tion is to find the structure with the lowest free energy. 
If one restricts the search to the set of secondary struc- 
tures without pseudoknots, several fast algorithms are 
available 0. However, when one includes the possi- 
bility of having pseudoknots, the problem is still open. 
An even simpler fundamental problem, namely the exact 
combinatorics of RNA structures with any pseudoknots, 
is unsolved. Results about the combinatorics of RNA 
secondary structures without pseudoknots or with very 
special classes of pseudoknots are available (e.g. 0,01); 
but the general case is still lacking. In this Letter we ad- 
dress precisely the problem of enumerating all secondary 
structures with pseudoknots. 

In order to get exact results, we make a few additional 
simplifications. We assume that any possible pairing be- 
tween nucleotides is allowed (independently of the type 
of nucleotides and from their distance along the chain) 
and that all the pairings may occur with the same prob- 
ability. In other words, we assume that the matrix Vij 
has all entries equal v > 0, i.e.: 

/ V + a V ■ ■ ■ V \ 



y V V ■ ■ ■ V + a J 

The real number a has been added in order for V to be 
definite positive. Of course this addition is purely formal 
since Zl{N) does not depend on a, as one can easily see 



from eq. (0). In fact no diagonal term Vn appears, as 
there are no self interaction diagrams. Even though the 
combinatorial problem in eq. is now greatly simplified, 
it still keeps a lot of its topological interest. In fact, by 
means of the matrix integral in eq. Q we can study the 
distribution of RNA structures with pseudoknots as a 
function of their topological character. Let us illustrate 
this point by a simple example for i = 4. In this case all 
possible contact structures are listed in Figure Q 
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FIG. 1: All possible arc diagrams with L = 4. Diagrams with 
i arcs are associated to the power and g is the genus. 

There is a total of ten possible arc diagrams, nine of 
which are planar and one which is not planar. The nine 
planar diagrams contain one diagram without arcs, six 
with one arc and two with two arcs. The same result can 
be directly obtained by computing the matrix integral in 
eq. JSJ. In fact, as we will show later in this Letter, the in- 
tegral evaluates precisely to Z4(Af) = l+Qv+2v'^+v'^ /N"^ . 
Thus the coefficients of the asymptotic series have a di- 
rect topological interpretation, and that is the reason 
why the asymptotic 1 /N''^ expansion is usually referred as 
topological expansion Each term of the series gives 
the number of diagrams with fixed topological character: 
the first term represents planar diagrams, the second rep- 
resents diagrams which can be drawn planarly on a sur- 
face with one handle (the torus), the third are diagrams 
that can be drawn planarly on a plane with two handles 
and so on. If we evaluate the integral in eq. JSJ for any 
finite L and finite N ^ we will have an analytical control 
over the topology and the combinatorics of eq. (0) . In the 
rest of the Letter, we will show explicitly how to compute 
the integral in eq. ((SJ- 

First, we note that by using a series of Hubbard- 
Stratanovich transformations, eq. Q can be exactly sim- 
plified to: 

^.W = ^/rf--|^-^ltr(l + .)^ (7) 

We see that the original integration over the L matrices 
(fk in eq. (|2Jl has been reduced to an integration over a 
single NxN matrix a. The similarity of eq. {Tj) and (|3Jl is 
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obvious, and will be demonstrated in a future publication 
Note that the regulator a drops out as long is it 
not zero. The normalization factor A{N) is: 
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The Gaussian matrix integral in eq. {Tj) is straightfor- 
ward. We introduce the spectral density of the matrix a 
at finite N: 
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By inserting the identity 1 = /J^^ dXp^iX) into eq. 0, 
we obtain: 
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Thus the multi-dimensional integral of eq. has been 
reduced to a one-dimensional integral. At this point it is 
convenient to study the exponential generating function 

°° j-L r+oo 

G{t,N) ^ Zl{N)- = / dApA,(A)e*(i+^) . (11) 

The explicit form of Pat (A) is a well known and classic 
result of Random Matrix Theory (se e e.g. j^^l [isj . 
and we use it in the form given in |l4|): 
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where Hk{x) are the Hermite polynomials: 
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By inserting eq. (|12|l into eq. Hll|) . one obtains 
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where we have used the formula: 
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The sum in eq. (|14|l can be expressed as a generalized 
Laguerre polynomial: 
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We finally obtain: 



Git,N) = e^->i^L, (-^) . (17) 

From this exact result we can extract informations on 
all the coefficients Zl{N). The series expansion in t of 
G{t,N) gives the first few coefficients Zl{N): 
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The meaning of these values is straightforward: the 
power of V is the number of arcs in the diagram, and 
the power of 1 /N'^ is the genus of the diagram. For in- 
stance when L = 7 there are 21 planar diagrams with one 
arc, and 35 diagrams on the torus (i.e. genus one closed 
oriented surface) with two arcs. The total number of di- 
agrams for each fixed genus can be obtained by putting 
V = I (for instance, the total number of diagrams on the 
torus for L = 6 is 25). Analogously, the total number of 
diagrams, irrespective of the genus, can be obtained by 
putting = 1 (for instance, the number of diagrams for 
L = 4 with 2 arcs is 3). 

The general topological expansion of Zl{N) with 
u = 1 is: 
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where the coefficients aL,g give exactly the number of di- 
agrams at fixed length L and fixed genus g. From formula 
(|17|l and eq. H18|l we recursively obtain all the coefficients 
a^^g. Moreover, by normalizing each a^^g by the total 
number of diagrams at fixed L, i.e. by A/" = ^l(I), we 
can obtain the distribution of the number of diagrams. 
In Figure |21 we plot the distributions of diagrams as a 
function of L and g. We note the interesting feature that 
for any given L >> 1 most of the diagrams are not pla- 
nar, and they have a genus close to a characteristic value 
{g)L- Such a value increases with L: we find numerically 
that it scales like {g)L ~ 0.23L, at large L. Also, for each 
fixed L there is a maximum possible value for g, namely 
.9 ^ L/4:. Conversely a structure can have a genus g only 
if it has a length at least L > Ag. 

It is important to note that even though the number 
of planar diagrams, aL^o, is exactly the number of sec- 
ondary structures without pseudoknots, and the number 
of diagrams on a torus, i.e. ul.i counts structures with 
one pseudoknot only, aL,g with g > 2 counts structures 
that contains either a single topologically complex pseu- 
doknot, or several simple pseudoknots with small genus. 
For that reason, the concept of irreducible pseudoknots 
has been introduced in |^ , and it would be of interest to 
study their distribution. The present analysis will be ex- 
tended to the case of irreducible pseudoknots in a future 
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FIG. 2: On the left: the normalized number of diagrams aL,glM at fixed t; as a function of L. On the right, the same quantity 
at fixed L as a function of g. In both cases we put v = \. 



publication, where we will also compute exact asymptotic 
behaviours for long sequences f[ll|). 

In this Letter, we have shown how one can compute the 
number of folded structures as a function of the length 
and of the genus of the RNA. This model is of course very 
schematic and oversimplified. It shows however that for 
a random RNA, the average topological character scales 
linearly with the length of the chain. As most wild RNA 
have an almost planar structure (with a genus g < 2), this 
implies that their sequences have been greatly designed 
by evolution in order to achieve this specificity. 
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